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Abstract. We present a flexible rule compiler developed for a text-to-speech (TTS) 
system. The compiler converts a set of rules into a finite-state transducer (FST). The 
input and output of the FST are subject to parameterization, so that the system can 
be applied to strings and sequences of feature-structures. The resulting transducer 
is guaranteed to realize a function (as opposed to a relation), and therefore can be 
implemented as a deterministic device (either a deterministic FST or a bimachine) . 



1 Motivation 

Implementations of TTS systems are often based on operations transform- 
ing one sequence of symbols or objects into another. Starting from the in- 
put string, the system creates a sequence of tokens which are subject to 
part-of-speech tagging, homograph disambiguation rules, lexical lookup and 
grapheme-to-phoneme conversion. The resulting phonetic transcriptions are 
also transformed by syllabification rules, post-lexical reductions, etc. 

The character of the above transformations suggests finite-state transduc- 
ers (FSTs) as a modelling framework |Sproat, T9 96 Moh ri, 1997| . However, 
this is not always straightforward for two reasons. 

Firstly, the transformations are more often expressed by rules than en- 
coded directly in finite-state networks. In order to overcome this difficulty, 
we need an adequate compiler converting the rules into an FST. 

Secondly, finite-state machines require a finite alphabet of symbols while 
it is often more adequate to encode linguistic information using structured 
representations (e.g. feature structures) the inventory of which might be po- 
tentially infinite. Thus, the compilation method must be able to reduce the 
inifinite set of feature structures to a finite FST input alphabet. 

In this paper, we show how these two problems have been solved in rVoicc, 
a speech synthesis system developed at Rhetorical Systems. 



2 Definitions and Notation 



A deterministic finite-state automaton ( acceptor, DFSA ) over a finite alpha- 
bet £ is a quintuple A = (£, Q, qo, 5, F) such that: 

Q is a finite set of states, and go £ Q is the initial state of A; 
S : Q x £ — > Q is the transition function of A; 
F C Q is a non-empty set of final states. 
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A (non- deterministic) finite-state transducer (FST) over an input alphabet 
E and an output alphabet A is a 6-tuple T — (E, A, Q, I, E, F) such that: 

Q is a finite set of states; 

I C Q is the set of initial and F C Q that of final states; 
EcQxQxEU {e} x A* is the set of transitions of T. We call a quadruple 
(q, q' , a, o) G E a transition from q to q' with input a and output o. 

Each transducer T defines a relation i?T on E* x A* such that (s, o) G i?^ iff 
there exists a decomposition of s and o into substrings Si, . . . ,St, Oi, . . . , Ot 
such that s\ ■ . . . ■ St = s, o\ ■ . . . ■ Ot = o and there exist states qo . . . q t G Q, 
go £ -f, Qt £ -P) such that (ft-i, ft, Sj, o.;) G E 1 for i = 1 . . . t. 

If Rt is a (partial) function from E* to Z\* , the FST is called functional. 

A deterministic finite-state transducer (DFST) is a DFSA whose transi- 
tions are associated with sequences of symbols from an output alphabet A. It 
is defined as T = (E, A, Q, q , S, a, F) such that (E, Q, q , 5, F) is a DFSA and 
o~(q, a) is the output associated with the transition leaving q and consuming 
the input symbol a. 

In addition to the concepts introduced above, we will use the following 
notation. If T, Ti, Ti are finite-state transducers, then T -1 denotes the result 
of reversing T. T\ ■ Ti is the concatenation of transducers T\ and Ti. T\o Ti 
denotes the composition of T\ and T2. 

3 Requirements 

In this section, we review the state of the art in finite-state technology from 
the angle of applicability to the symbolic part of a TTS system. 

3.1 Finite-State Rule Compilers 

Many solutions have been proposed for compiling rewrite rules into FSTs, cf. 
|Kaplan and Kay, 1994|Roche and Schabes, 1995|Mohri and Sproat, 1996| . 

Typically, a rewrite rule cj> 1/)/ A_p states that a string matching a regular 
expression </> is rewritten as i\> if it is preceded by a left context A and followed 
by a right context p, where both A and p are stated as regular expressions over 
either the input alphabet E or the output alphabet A. The compiler compiles 
the rule by converting 0, A and p into a number of separate transducers and 
then composing them into an FST that performs the rewrite operation. 

Since a rule may overlap or conflict with other rules, a disambiguation 
strategy is required. There are several possibilities. Firstly, if the rules are 
associated with probabilities or scores, these numeric values may be added 
to transitions in the form of weights, thus defining a weighted finite-state 
transducer (WFST). Such a WFST is not determinizable in general, but the 
weights may be used to guide the search for the best solution and constrain 
the search space. 
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Secondly, a deterministic longest-match strategy may be pursued. Finally 
we may regard the order of the rules as meaningful in the sense of priorities: 
if a rule Rk rewrites a string s that matches its focus (j>k, it blocks the appli- 
cation of all rules Ri such that i > k to any string overlapping with s. 

In our research, we have focused on the third strategy as the most ap- 
propriate one in the context of our TTS system and the available resources. 
This choice makes determinizability a particularly desirable feature of the 
rule FSTs as it guarantees linear-time processing of input. Although a trans- 
ducer implementing rules with unlimited regular expressions in the left and 
the right context is not determinizable in general |Poibeau, 2001| , determin- 
istic processing is still possible by means of a bimachine, i.e., an aggregate of 
a left-to-right and a right-to- left DFSA [Berstel, 1979| . For this, the resulting 
rule FST must realize a function. 

Unfortunately, the compilers described by |Kaplan and Kay, 1994| and 
|Mohri and Sproat, 1996] are not guaranteed to produce a functional trans- 
ducer in the general case. Thus, we have had to develop a new, more appropri- 
ate compilation method. The new method is described in detail in section 

3.2 Complex Input Types 

In rVoice, linguistic information is internally represented by lists of feature 
structures. If o is an item and / a feature, f(o) denotes the value of / on o. 

Rewrite operations can be applied to different levels of this model, the 
input sequences being either strings of atomic symbols (characters, phonemes, 
etc.) or sequences of items characterized by feature-value pairs. While the 
former case is straightforward, the latter requires a translation step from 
feature structures to a finite alphabet of symbols. 

This issue has been addressed in a wide range of publications. The so- 
lutions proposed mostly guarantee a high degree of expressivity, including 
feature unification. The price for the expressive power of the formalism is 
non-deter minism |Zajac, 1998| and/or the use o f rather expensive unification 
operations |Becker et al., 2002|Constant, 2003| . 

For efficiency reasons, we have decided to pursue a more modest approach 
in the current implementation. The approach is based on the observation that 
only a finite number of feature- value pairs are used in the actual rules. Since 
distinctions between unseen feature-value pairs cannot affect the mechanism 
of rule matching, unseen features can be ignored and the unseen values of 
the seen features can be merged into a special symbol 

If /i • ■ ■ Ik are the seen features and Si... Sk the respective sets of values 
appearing in the rules, then a complex input item o can be represented by 
the fC-tuple (vi . . . vk) such that Vi G Si U {#} is defined as 




fiip) G Si 

fi(o) undefined or /,(o) £ Si 
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The context rules are formulated as regular expressions whose leaves are item 
descriptions. An item description, e.g., [pos — nn\nnp case = u], consists 
of a set of feature-value descriptions (here: pos = nn\nnp and case — u), 
determining a set Uj of values for the respective feature fj . If no feature- value 
description is specified for a feature fj, we set Uj = Z\jU{#}. Clearly, an item 
(vi . . . vk) matches an item description [U\ . . . Uk] iff vi £ U% . . . vk £ Uk- 

This leads to the desired regular interpretation of feature-structure match- 
ing rules: a concatenation of unions (disjunctions) of atomic values. If case, 
pos and type are the relevant features, the last one taking values from the 
set {alpha, digit}, the item description [pos = nn\nnp case — u] is inter- 
preted as (nn\nnp) ■ u ■ (alpha\digit\Jf). Clearly, this interpretation extends 
to regular expressions defined over the set of item descriptions. For example, 
{[pos — nn\nnp case = u]) + is interpreted as ((nn\nnp) -u- (alpha\ digit |#)) + . 

4 Formalisation 

4.1 The Rule Formalism 

For reasons of readability, we decided to replace the traditional rule format 
(<f> — > ip/\_p) by the equivalent notation X/(f>/p — > tp, which we found much 
easier to read if A and p are complex feature structures. Thus, the compiler 
expects an ordered set of rules in the following format. 

h/4>i/pi ->ipi,i=l.--n 

Xi and pi are unrestricted regular expressions over the input alphabet S. The 
focus 4>i is a fixed-length expression over S. The right-hand side of the rule, 
tpi, is a (possibly empty) sequence of symbols from the output alphabet A. 

Compared to |Kaplan and Kay, 1994| and |Mohri and Sproat, 1996| , the 
expressive power of the formalism is subject to two restrictions. Firstly, the 
length of the focus (</>) is fixed for each rule, which is a reasonable assumption 
in most of the mappings being modelled. Secondly, only input symbols are 
admitted in the context of a rule, which appears to be a more severe restric- 
tion than the first one, but does not complicate the formal description of the 
considered phenomena too much in practice. 

4.2 Auxiliary Operations 

In this section, we define auxiliary operations for creating a rule FST. 

accept_ignoring(/3,Af ) This operation extends an acceptor for a pattern (3 
with loops ignoring symbols in a set M of markers, M n £ = 0. In other 
words, accept_ignoring(/3, M) accepts w £ (SLiM)* iff w can be created from 
a word u G S* that matches (3 by inserting some symbols from M into u. 

The construction of accept_ignoring(/3,M) is straightforward: after cre- 
ating a deterministic acceptor A = (S,Q,qo,5, F) for (3, we add the loop 
5{q, p) = q for each q £ Q and p € M . 



A Flexible Rule Compiler for Speech Synthesis 



5 



acceptJgnoring_nonfin(/3,M ) is like accept_ignoring(/3,M ) except that it does 
not accept symbols from M at the end of the input string. For example, 
accept_ignoring_nonfin(a*,{#}) accepts aaaa and ##a#aa, but not aaa###. 

The construction of this FSA is similar to that of acceptJgnoring(/3,M). 
First, we create a deterministic acceptor A — (E, Q, qo, <5, F) for j3. Then a 
loop S(q, (i) = q is added to A for each /i £ M, q F. Finally for each q E F: 

• if S(q, a) is defined, its target is replaced with a new non-final state q'; 

• we add the transitions S(q' , fj) := q' for each \i G M and S(q, e) := q' . 



# 




Fig. 1. Construction of accept_ignoring_nonfin( / 9 I {#}). 



replace(/9,7) translates a regular expression /3 into a string 7. It is constructed 
by turning an acceptor A = (E,Q,qo,5, F) for (3 into a transducer T = 
(E, Q U {<?/}, 9oj cr, {?/}) such that qf is a new final state, a(q, a) := e for 
each (q, a) € Dom{5), 5 cS, S(q, e) := g/ and cr(q, e) := 7 for each q E F. 

mark_regex£(/3,/i) This operation inserts a symbol /z after each occurrence 
of a pattern (3. It is identical to the type 1 marker transducer defined in 
|Mohri and Sproat, 19961 It can be constructed from a deterministic acceptor 
A = (S, Q, qo, 5, F) for the pattern £* fj in the following way: first, an identity 
transducer Id(A) = (E,S,Q,qo,S,a,F) is created such that a(q,a) = a 
whenever S(q, a) is defined. By construction, Id(A) is deterministic. 

Then, T = (E, E U {^}, Q U F', q , 5, a, (Q U F')\F) is created such that 

F' := {q 1 : q G F} (a copy of each final state of Id(A)) 
5(q, a) = 5(q, a),a(q, a) = a(q, a) for q $ F, a G E 
6(q', a) = 5{q, a),a(q', a) = a(q, a) for q G F, a G E 
%) e) = q', a(q, e) = fi for q G F 

Informally, the construction of T consists in swapping the final and non- 
final states of Id{A) and splitting each final state q of A in two states q and 
q' such that all transitions t leaving q in A leave q' in T. The two states are 
then connected by a transition (q, q' , e, /j,), as shown in figure 
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Fig. 2. Construction of the mark_regex FST inserting < after each match of (3. 

left_context_filter^(/3,/i) This operation deletes all occurrences of a symbol /i 
in a string s E that are not preceded by an instance of pattern (3. A 

transducer performing this operation can be constructed from a deterministic 
acceptor A = (E, Q, qo, 8, F) for the pattern E* (3 by creating an identity 
transducer Id(A) = (E, E, Q, qo, S, a, F) and then turning it into a transducer 
T = (EU {m}, H U {m}, Q, qo, 8, v, Q) such that: 

8(q, a) = 8(q, a),a(q, a) = <j(q, a) for q G Dom(S) 
5(q,fi) = q for q G Q 

&(<1> M) = A* for q G F (copying of fi into the output after a match of /3) 
a(q, (jl) = € for q F(deletion of after a string that does not match 0) 



<:EPS 




Fig. 3. The left_context_filter FST deleting < if it is not preceded by j3. 



4.3 Constructing a Rule FST 

Each rule is compiled into a composition of two FSTs. The first one inserts 
the symbol <j before each match of fa ■ acceptJgnoring^, Markers^), where 
Markers<i is the set of all markers <j, j < i. The second transducer deletes 
all occurrences of <i that are not preceded by an instance of the left context 
pattern A», possibly interspersed with markers inserted by previous rules (<j). 
The resulting translation is the original string with the marker <j inserted 
at all positions where rule Ri fires. 

Both FSTs are obviously functional, and so is their composition. 
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Marking of the Right Context and Focus Match The first trans- 
ducer pre_mark inserts a left focus marker (<j) before each match of <pi ■ 
accept_ignoring(pi, Markers^). It is right-to-left deterministic and can be cre- 
ated by composing the following operations: 

pre_marki = mark_regex£ UMarkers< . ([^•accept_ignoring( / 9 i , Markers^)] -1 , <i) _1 

Note that the mark_regex operation is performed relative to the extended 
alphabet £ U Markers<i as the input string may already contain markers 
inserted by an earlier rule. 

Checking the Left Context The task of the second FST, check_left_cxt, 

is to remove all occurrences of <j that are not preceded by an instance of 
Aj. Note that the substring matching A» may contain some of the markers 
<i,... ,<i, therefore the left_context_filter operation is performed relative to 
the extended alphabet £ U Markers^ = £ U {<i, <j-i}. 

check_left_cxtj = left_cxt_filter^ U Markers <i (accept_ignoring(Aj, Markers^), <j) 

Composition of Rule Transducers The transducer for rule Ri is the 
result of the composition: rj := pre.marki o check_left_cxti. 

Since both transducers are deterministic (hence functional), the result of 
their composition is functional, too. The application of the rules Ri,...,R n 
to a string s is then modelled by the composition of FST's: (r 1 or 2 o...or n o 
rewrite) (s). rewrite is a simple FST that, having read a marker symbol <j, 
leaves the initial state and jumps to a subnetwork translating fa to tpi (ignor- 
ing markers). When the translation is finished, rewrite returns to its initial 
state, rewrite can be constructed as the closure of the union of transducers 
rewrite.ru leJocuSi, i = l...n, defined as: 1 

rewrite_rule_focuSi := replace(< 4 •accept_ignoring_nonfin(</>i, Markers>i), tpi) 

Note the use of accept_ignoring_nonfin rather than just acceptJgnoring. This 
guarantees that the transducer will not consume any markers following the 
last character of <pi (these markers indicate the next rule application). 
The transducer rewrite is then defined as: 

n 

rewrite := (|^J rewrite_rule_fociiSi)* 

i=i 

Clearly, accept_ignoring_nonfin is determinizable, and the resulting transducer 
rewrite is deterministic. With r\ . . . r n being functional, it follows that the 

1 We assume that at least one marker will be inserted at each position in the input 
string. This can be achieved by specifying a default rule //i/ — ► 7 for each (igl. 
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rational relation r% o T2 ° . . . o r n o rewrite is functional. Therefore, the result 
of the compilation is a functional FST that is either determinizable or can 
be factorized into a bimachine. 2 

5 Applications 

rVoice is implemented as a pipeline of modules that successively transform 
the input string into sound. The text processing modules create a sequence 
of segments (phones and pauses) organized into syllables, words and phrases. 
The result is passed to the speech modules that generate the actual speech 
signal. At each level, linguistic information is represented by a heterogeneous 
relation graph |Taylor et al., 2001] , typcally a list of feature structures. 

Each module creates a new relation or adds information to the existing 
ones. The tokenizer splits the input string into a list of tokens. The text nor- 
malisation module expands abbreviations, numbers, dates, etc., creating a 
list of words, each one annotated with a normalised word form. Further mod- 
ules (part-of-speech/homograph tagger, reduction module, language identifi- 
cation) set features such as part-of-speech on the words. 

The lexicon module tries to find a phonetic transcription for the normal- 
ized word form that is consistent with the features set on it. If it fails, the 
transcription is generated by letter-to-sound rules. 

In order to acommodate the requirements of different TTS modules, our 
rule compiler is parameterizable with respect to input types and emissions. 
Two specific instantiations have been employed so far. The first one is the 
conversion the string of atomic symbols from an input alphabet S into a 
string of symbols from an output alphabet A. The second application is 
setting features on a list of complex objects (relation items). In the remainder 
of this section, we illustrate each of the two scenarios with an application. 

5.1 Grapheme- To-Phoneme Conversion 

The case of grapheme-to-phoneme conversion is straightforward. The input 
alphabet £ comprises all alphabetic characters, while the output alphabet A 
is the set of all legal phonetic symbols of the language under consideration. 
For each character, we need to write rules describing how this character can 
be pronounced. If more than one pronunciation is possible, each variant is 
covered by a rule. The ordering of the rules determines how conflicts between 
rules are resolved and makes it possible to write simple default rules. 

The following rules describe the pronunciation of 'c' in American Spanish: 

2 Note that if the focus of a rule contains more than one character, rules with 
lesser priority may insert markers into the matched string. For example, the rules 
Ri : /ah/— > X and R2 : /&/— > Y will mark up the string ab as <i a <2 b, but, 
in accordance with the operational semantics of the compiler, the second marker 
will be ignored by the rewrite transducer when the match is rewritten as X. 
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/ 8 c / (e|i) -> s 
/ c / (e|i) -> s ; 
/ c h / -> ch ; 
/ c / -> k ; 



# ascienda -> [asienda] 

# cenar -> [senar] 

# ocho -> [ o ch o ] 

# default rule: k 



Such hand-written rules are used for languages that have a very regular 
orthography, such as Spanish, which is covered by 110 rules, including stress 
assignment. The resulting FST has 119 states and 5160 transitions. 

5.2 Homograph Disambiguation 

In rVoice, homograph disambiguation is the result of an interaction between 
several modules. First of all, a statistical part-of-speech tagger determines the 
part-of-speech of each word in an utterance. This information is useful, but 
not always sufficient for determining the right pronunciation. For instance, 
both pronunciation variants of lead are compatible with the POS noun, as 
in the sentences Lakeview took a 14--0 lead in the second quarter and There's 
very high lead levels in your water. Furthermore, the POS tagger may be 
consistently inaccurate in certain contexts, in which case its predictions are 
overridden by hand-written homograph rules. The rules refer directly to the 
sense IDs associated with the pronunciation variants of the word in question, 
as the rules that disambiguate between the different senses of suspects: 

[name=that] / [name=suspects] / -> [sense=2] ; 
( [pos=dt I cd] I [name=terror] ) / [name=suspects] / -> [sense=l] ; 
/ [name=suspects] / [name=that] -> [sense=2] ; 
/ [name=suspects] / -> [sense=l] ; # default rule 

To explain how the rules interact, we will look at the following example: 



We can see that the second and the third rule match the context of word 3. 
The action associated with the lower rule index is chosen, resulting in the 
value of sense being set to 1 on the item. 

According to the compilation method described in section HOI a sequence 
of items is translated into a sequence of relevant feature values. The compiled 
rule FST rewrites this sequence as a sequence of features to be set according 
to the right-hand-side of the rules (in this case, it is the feature sense). 

6 Conclusions 

By using FSTs, we have achieved a uniform and declarative way of expressing 
linguistic knowledge in rVoice. The rule compilers are run off-line for each 
FST-based module, producing a DFST encoding the combined rules used by 
this particular module. The FST is loaded by the system at runtime. Thus, it 
has been possible to achieve a clear separation of the language-independent 



the\ terror2 suspects^ that^ were$ in§ courts 
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processing algorithms and the language-, accent- or speaker-specific data (the 
FSTs). The (minimized and determinized) FSTs have contributed to a sig- 
nificant speedup and footprint reduction. 

The interaction of the rule-based FSTs and the automatically trained text 
modules (POS tagger, language identification) reflects the strengths of both 
approaches. The latter components, trained on newspaper text, guarantee a 
high accuracy baseline on input similar to the available training material. In 
particular, the POS accuracy is over 96% on news text, while the accuracy 
of language identification exceeds 99% (both measured per token). The rule- 
based modules are typically used to correct or to complement the predictions 
of the automatically trained modules, for example on untypical text genres, 
or in response to specific customer requirements. 

The FST-based rVoice modules comprise homograph disambiguation, post- 
lexical reductions, graphemc-to-phoneme conversion and syllabification. In all 
these applications, the compiler has proved to be a flexible and useful com- 
ponent of the system. 
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